Optimal Parallel Algorithms for Computing the Sum, the Prefix-Sums, and the Summed Area Table on the Memory Machine Models
نویسنده
چکیده
The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in O( n w + nl p + l log n) time units on the DMM and the UMM. We then go on to show that Ω(n w + nl p + l log n) time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in O( n w + nl p + l log n) time units on the DMM and the UMM. Finally, we show that the summed area table of size √ n × √n can be computed in O( n w + nl p + l log n) time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal. key words: Memory machine models, prefix-sums computation, parallel algorithm, GPU, CUDA
منابع مشابه
A High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure
The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...
متن کاملA High Performance Parallel IP Lookup Technique Using Distributed Memory Organization and ISCB-Tree Data Structure
The IP Lookup Process is a key bottleneck in routing due to the increase in routing table size, increasing traıc and migration to IPv6 addresses. The IP address lookup involves computation of the Longest Prefix Matching (LPM), which existing solutions such as BSD Radix Tries, scale poorly when traıc in the router increases or when employed for IPv6 address lookups. In this paper, we describe a ...
متن کاملProgram-Centric Cost Models for Locality and Parallelism
Good locality is critical for the scalability of parallel computations. Many cost models that quantify locality and parallelism of a computation with respect to specific machine models have been proposed. A significant drawback of these machinecentric cost models is their lack of portability. Since the design and analysis of good algorithms in most machine-centric cost models is a non-trivial t...
متن کاملFoundational Algorithms for Distributed Robot Swarms
In this paper, we study discrete swarm algorithms, where mobile robots (or “mobots”) move around interacting in an environment to solve computational problems. This work extends recent work on swarm algorithms in the distributed computing, artificial intelligence, and robotics literatures in that we allow for mobots to have additional memory so as to enable computations that are somewhat more s...
متن کاملPaper Title
Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA imple...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEICE Transactions
دوره 96-D شماره
صفحات -
تاریخ انتشار 2013